Dimension independent similarity computation

نویسندگان

  • Reza Bosagh Zadeh
  • Ashish Goel
چکیده

We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incremental All Pairs Similarity Search for Varying Similarity Thresholds with Reduced I/O Overhead

All Pairs Similarity Search (APSS) is the problem of finding all pairs of records with similarity scores above a specified threshold. Incremental All Pairs Similarity Search (IAPSS) is the problem of performing APSS multiple times over the same dataset by varying the similarity threshold. This problem is ubiquitous in many real-world systems like search engines, online social networks, and digi...

متن کامل

Notes on quantitative structure-properties relationships (QSPR) (1): A discussion on a QSPR dimensionality paradox (QSPR DP) and its quantum resolution

Classical quantitative structure-properties relationship (QSPR) statistical techniques unavoidably present an inherent paradoxical computational context. They rely on the definition of a Gram matrix in descriptor spaces, which is used afterwards to reduce the original dimension via several possible kinds of algebraic manipulations. From there, effective models for the computation of unknown pro...

متن کامل

Expectations on fractal sets

Using fractal self-similarity and functional-expectation relations, the classical theory of box integrals—being expectations on unit hypercubes—is extended to a class of fractal “string-generated Cantor sets” (SCSs) embedded in unit hypercubes of arbitrary dimension. Motivated by laboratory studies on the distribution of brain synapses, these SCSs were designed for dimensional freedom—a suitabl...

متن کامل

WaldHash: sequential similarity-preserving hashing

Similarity-sensitive hashing seeks compact representation of vector data as binary codes, so that the Hamming distance between code words approximates the original similarity. In this paper, we show that using codes of fixed length is inherently inefficient as the similarity can often be approximated well using just a few bits. We formulate a sequential embedding problem and approach similarity...

متن کامل

Retrieving Images by 2D Shape: A Comparison of Computation Methods with Human Perceptual Judgments

In content based image retrieval, systems allow users to ask for objects similar in shape to a query object. However, there is no clear understanding of how computational shape similarity corresponds to human shape similarity. In this paper several shape similarity measures were evaluated on planar, connected, non-occluded binary shapes. Shape similarity using algebraic moments, spline curve di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Machine Learning Research

دوره 14  شماره 

صفحات  -

تاریخ انتشار 2013